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Abstract: Early and intermediate vision algorithms, such as smoothing 
and discontinuity detection, are often implemented on general-purpose se- 
rial, and, more recently, parallel computers. The excessive time required by 
these general-purpose computers prevents real-time computation of these 
vision algorithms. Special-purpose hardware implementations of low-level 
vision algorithms may be needed to achieve real-time processing. 

This memo reviews and analyzes some hardware implementations of low- 
level vision algorithms. Two types of hardware implementations are con- 
sidered: the digital signal processing chips of Ruetz (and Broderson) and 
the analog VLSI circuits of Carver Mead. Both these approaches claim to 
achieve real-time image processing; both have limited the vision problem 
that they solved in ways largely inconsistent with vision processing in un- 
restricted environments. The advantages and disadvantages of these two 
approaches for producing a general, real-time vision system are considered. 
As early attempts at comprehensive vision hardware, these two approaches 
provide useful insights for future developments of vision hardware. 
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1 Introduction 

The purpose of this paper is to compare two approaches to special-purpose 
hardware for vision: the analog VLSI approach of Carver Mead[l] at Cal- 
tech and the digital VLSI approach championed by Ruetz[2] at Berkeley. 
These two researchers have adopted fundamentally different views on the 
implementation of vision algorithms in hardware. This paper will provide 
an overview of their techniques, assumptions, perceived motivation and phi- 
losophy. These issues have important consequences for future developments 
of vision hardware, including the recent M.I.T.[3] proposal. 

The fundamental problem of machine vision is to recognize objects and 
to navigate through an environment by processing of camera images. This 
problem of machine vision is typically broken into three levels: early vision, 
intermediate vision, and recognition [4]. These three levels are all computa- 
tionally intensive. Among these three levels, early and intermediate vision 
algorithms have similar computational requirements. Early and interme- 
diate vision are charged with taking the input data, at camera rate, and 
producing a lower complexity, symbolic representation of the scene. The 
early vision level processes the input images to determine surface properties 
in the 3-dimensional scene. Typical surface properties are: depth, motion, 
color or albedo, and texture. The task of intermediate vision is to com- 
pute the discontinuities in the surface properties provided by early vision. 
The discontinuities mark abrupt changes in surface properties and usually 
correspond to object boundaries. 

By comparison, the recognition level is computationally intensive be- 
cause of the combinatorics of recognition. Recognition uses the object 
boundaries provided by intermediate vision to identify the objects. These 
object boundaries may be symbolic representations, a "feature," such as a 
line (modeled by, say, position, length, angle, and strength). For typical 
scene features, recognition database sizes, and model features, the possi- 
ble combinations quickly become overwhelming. Recognition algorithms are 
drastically different than the generally pixel-based algorithms of early and 
intermediate vision. For this reason, this paper will not deal with the prob- 
lems of hardware for recognition. Rather the focus will be on pixel-based 
algorithms for early and intermediate vision. 

The two levels, early and intermediate vision, share some low-level vision 



algorithms such as smoothing and discontinuity detection. For many years 
these algorithms were implemented on general-purpose serial, and, more re- 
cently, parallel computers. The use of a general-purpose computer facilitates 
modification to the algorithms as research objectives change. A draw-back 
of such computers has been the excessive time required for the computa- 
tions. On a serial computer, even the relatively primitive operation of edge 
detection can take minutes or, on a parallel computer, seconds. Worse still, 
a sophisticated algorithm for smoothing surface property data while preserv- 
ing discontinuities^] can take minutes on a parallel computer such as the 
Connection Machine. Neither of these speeds approach the camera frame 
rate. The need for vision hardware derives from the difficult computational 
requirements during the early stages of vision processing due to the large 
data rate. 

Both the approaches of Mead and Ruetz claim to perform real-time image 
processing. As discussed later, both have limited the vision problem that 
they solved in ways largely inconsistent with a vision system for general 
environments. The approach of Ruetz limits the vision problem to two- 
dimensional, motionless images with constraints on the image backgrounds. 
Mead's approach is limited in one regard by its photosensor resolution which 
mandates coarse image analysis and, consequently, is probably unusable for 
recognition tasks. 

Although limited, these two approaches, as early attempts at compre- 
hensive vision hardware, provide useful insights for future developments of 
vision hardware. For example, the trade-offs between local processing and 
photodetector density in Mead's approach must be addressed when contem- 
plating vision hardware. Such decisions have important consequences in 
design time, circuit modularity, and optical properties. 

The reasons for the differences in these two approaches to vision hard- 
ware stems largely from philosophical differences and goals. Mead appears 
to be more interested in using VLSI to explore biological implementations. 
His "silicon retina" is one example where the circuit design is driven by 
the biological design. Ruetz takes an engineering point of view in which a 
functioning, noiseless device is produced even if it poorly approximates the 
ultimate problem to be solved. 

The remainder of this paper is organized into four sections. The first sec- 
tion supplies an overview of the vision problem and outlines various possible 



algorithms for vision hardware. The second and third sections are devoted 
to the two hardware approaches. Each section details the vision problem 
solved, the background information regarding the hardware, and the ad- 
vantages and limitations of the hardware as implemented. The philosophy 
and goals driving the research for these two approaches is also discussed. 
The final section provides a more direct comparison between the two ap- 
proaches and includes suggestions for future developments as a convergence 
of techniques. 



2 Vision Algorithm Primitives 

In this analysis of vision algorithm primitives the emphasis is on the early 
and intermediate stages of vision processing. The primary outputs for this 
processing are the discontinuities in the surface properties and, to a lesser 
extent, the surface properties themselves. These outputs would be sub- 
sequently processed by a recognition system to identify objects in the 3- 
dimensional scene. This recognition process will not be discussed here. 



2.1 Edge and Discontinuity Detection 

Discontinuity detection is basically a generalization of the problem of 
edge detection. Figure 1 provides an overview of early vision and discon- 
tinuity detection. This figure shows the 3-D scene composed of M objects. 
All the points on each object have a surface property vector 



Xi= { 



Position, f 
Texture Class 
Velocity 
Surface Color 



where X identifies the imaged point (i.e. pixel) and i is the object label. 
The 3-D scene is imaged by one or more optical systems, at repeated times, 
to yield a set of images, I(x,y), distinguished by the time of measurement 
and the position of the optical system. The task of early vision is to use 
the set of images to determine this surface property vector at a subset of 
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Figure 1: Aa overview of the early aad intermediate vision tasks 



the image pixels. For example, a stereo algorithm produces the position, a 
motion algorithm determines the velocity, and a texture algorithm classifies 
the texture of a pixel. The surface property vector is then used by inter- 
mediate vision to ascertain the boundaries for each object i, (i € M), and 
consequently to group the image points X,- comprising each of the i objects. 

Imaging technology is such that the images, /, are spatially sampled by 
the imaging device. The imaging devices respond to incident light inten- 
sity and are typically arrays of photodetectors like charge-coupled-devices 
(CCDs) producing charge or phototransistors (PTs) producing current. The 
photodetector arrays can be linear or rectangular, hexagonalfl], or even 
"foveal"[6]. The output from the imaging device can be either continuous 
in time or discrete. CCDs are clocked devices that gather charge to produce 
a discrete-time signal. PTs produce a continuous-time signal. Both signals 
have analog magnitudes. Once the image has been detected by the imaging 
device, the signal processing begins at each pixel. 

Edge detection entails finding those locations in the image where the 
incident light intensity varies rapidly in space. This is performed by finding 
the maximum in the gradient or the zeros in the second derivative. The 
problem of differentiation is difficult because of the presence of noise in the 
image signal. The noise is reduced by filtering or, equivalently, smoothing; 
however, smoothing has the undesirable characteristic of also reducing the 
edge signal. 



2.1.1 Smoothing Techniques 

One approach to edge detection is to convolve the image signal with a Gaus- 
sian and then to look for zeros in the Laplacian of the convolved output [7]. 
Convolution with a 2-D Gaussian has several convenient qualities[8]: 1) the 
kernel is circularly symmetric and therefore does not favor any direction 
a priori, 2) it is separable in x and y, 3) its Fourier transform is also a 
Gaussian, and 4) it can be approximated by a binomial series. 

The binomial approximation to convolution with a Gaussian is made by 
repeatedly convolving the image with the mask {1/2,1/2}. Performing a 
binomial convolution has several favorable attributes for hardware imple- 
mentation. First, only local pixel access is required and second, scaling is 



by factors of 2 which is more easily implemented in some hardware systems. 
Each convolution with the {1/2,1/2} mask yields successive terms in the 
binomial series. 

In practice, either type of convolution, Gaussian or its binomial approx- 
imation, is acceptable. Of course Gaussian convolution is much simpler to 
analyze theoretically; however, the accuracy of edges should not "make or 
break" a vision system (at least until proven otherwise). This results from 
the considerable confusion regarding what edges are optimal for recognition. 

A more general approach to smoothing that proves useful for hardware 
implementations is regularization[9] . The regularization formulation for vi- 
sion seeks to minimize the error between the input signal and the output 
signal subject to constraints. The constraints are designed to impose a priori 
assumptions about the nature of the solution. For example, with an input 
signal of light intensity and a smooth output signal desired, an appropri- 
ate constraint might be the gradient of the output. The following equation 
expresses this notion for a continuous output field, f(x) given input g(x). 
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The function, f(x), that minimizes E is sought. The first term in this 
equation requires that / be close to the input data g; the second term 
requires that / be smooth. 

Finding the minimum of Equation 1 is a problem in variational calculus. 
The solution for f(x) is found by solving 

- AV 2 / + af = ag (2) 

For a 1-D problem with continuous data, the Fourier transform of Equation 2 

f « = ^W G «- (3) 

where r] is the angular spatial frequency. The regularized solution can be 
viewed as nothing more than a convolution of the input with a low-pass 
filter. 

Formulating the vision problem as an energy minimization is natural for 
implementing the problem in a physical system[10, 11]. Physical systems, 
electrical, mechanical, etc, minimize the system's Lagrangian. For the case 



of an electrical network the Lagrangian is simply the network's energy. If 
the electrical network's energy is designed to duplicate a vision problem's 
energy, the network will solve the vision problem. 



2.1.2 Edge Detection 

Many edge detection techniques exist. Possibly the most studied and most 
biologically relevant edge detector may be Gaussian convolution followed by 
application of the Laplacian operator[7]. The computation of the Lapla- 
cian of a Gaussian (LOG) convolution can be performed or approximated 
in several ways. One way is to first convolve with the Gaussian and to 
subsequently compute the Laplacian with one of its masks[12]. This is nat- 
ural for many digital systems. An approximation to the LOG is based on 
the biological "center-surround" receptor and entails simply subtracting the 
smoothed background, the surround, from the signal, the center. After the 
LOG calculation, edges are identified by the zero- crossings. Yet a third way, 
based on the approximation to the LOG as a difference of Gaussians (DOG), 
is to convolve with two Gaussians and then to subtract the two. 

The DOG approximation to LOG convolution has been implemented in 
hardware[13]. The implementation exploits two observations: 1) the solu- 
tion to the diffusion equation is the convolution of the initial distribution 
with a Gaussian and 2) the voltages in a distributed resistive/capacitive 
transmission line obey the diffusion equation. The width of the Gaussian is 
a function of time so that the DOG is computed by sampling the voltages 
on the transmission line at two times and then subtracting them. 

Discontinuity detection can be viewed as generalized edge detection. 
Edge detection seeks discontinuities in the light intensity; discontinuity de- 
tection seeks discontinuities in surface property data. Using the surface 
property data, such as depth from stereo, adds an additional complication 
since the data can be sparse. The surface property data is sparse because 
some early vision algorithms produce surface property data only at intensity 
edges. 



2.1.3 Smoothing with Discontinuity Detection 

One problem with the preceding edge detector analysis is that the smoothing 
process reduces precisely that signal from the differential operator needed 
to identify the edge. The edges themselves are smoothed away. To some 
extent this problem can be eliminated by combining the smoothing and 
edge detection processes[14, 11, 15, 16, 17]. These techniques smooth the 
data unless a discontinuity is detected. Smoothing is abandoned between 
locations separated by a discontinuity. The output is the smoothed data 
and the discontinuities in the data. The computation proceeds by finding 
the configuration of data and discontinuities that minimizes a function. An 
example function is shown below. 

Ei = £ {(fi - fj)\l - hj) + a(fi - 9i ? + PVcVij)} (4) 

iec, 

The variable /, is the output data at site t; /,j is the output discontinuities, 
a binary value, separating site i and j. The function is designed to impose 
constraints of smoothness and continuity on the output data and discon- 
tinuities. The function E{ is not quadratic and, consequently, stochastic 
methods must be used to minimize E{. 



3 Analog VLSI 

This chapter describes the use of analog VLSI devices for vision work. 
Largely initiated by Caltech's Carver Mead[l], analog VLSI is now also used 
by, among others, Christof Koch[18], also at Caltech. Another approach to 
analog computation of vision algorithms utilizes CCD technology[19, 20, 21]. 
However, within the scope of this paper, the CCD technology will not be 
analyzed. 

This chapter is divided into three sections. The first section contains a 
discussion of subthreshold CMOS devices and is followed by a section on 
hardware implementations of vision algorithms. The final section analyzes 
the analog VLSI's applicability for implementing a real-time vision system. 
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Figure 2: An n-channel MOS device. P-channel devices are fabricated within 
an n-well. The device parameters are presented in Section 3.1.2. 

3.1 Subthreshold CMOS 

The use of MOS devices in the subthreshold regime has been championed by 
Carver Mead[l]. For vision applications, the subthreshold regime is preferred 
by Mead for three reasons: 1) the exponential dependence of the drain 
current as a function of gate voltage, 2) the low power usage in this regime, 
and 3) the near current source characteristic of the source-drain terminals 
for Vds >~ lOOmV. The following sections describe the basic device physics 
of subthreshold MOS operation, outline the circuit model, and presents some 
of the limitations and advantages of these CMOS devices. 



3.1.1 Device Physics 

The majority of MOS devices are usually not operated in the subthresh- 
old region. Most texts, in fact, call the drain current zero unless the gate 
voltage is above the threshold voltage, while for gate voltages above this 
threshold the drain current is linear or quadratic in the gate voltage. Fig- 
ure 2 shows an n-channel MOS structure and will be used during the de- 
scription of MOS operation. 

When a positive voltage is applied to the gate, a "channel" forms just 
below the gate oxide in the p substrate. The channel is formed by expelling 
the majority carrier holes which leaves a depletion region of fixed acceptor 



atoms. If the gate voltage is large enough, free-moving, minority-carrier 
electrons can also occupy this depletion region. Both the fixed acceptor 
atoms and the induced, free electrons within this depletion region balance 
the positive charge on the gate electrode; the gate oxide acts like a capacitor. 

When the density of free electrons in the depletion region number much 
less than the acceptor ion density, the MOS is in the subthreshold region. 
If these free electrons are ignored and Gauss's law is applied to the ox- 
ide/substrate interface, there can be no electric field parallel to the surface 
and, therefore, the interface is an equipotential surface. Any free electron 
motion along the interface cannot be due to drift, only diffusion. To compute 
this diffusion current the source and drain voltages must be considered. 

The current due to diffusion is given by 

I = ^.(N dl -N n) (5) 

where q = electronic charge, w = transistor width, D = diffusion constant, 
Nd g = electron density at the drain-gate region, and N sg = electron density 
at the source-gate region. / is the length of the channel. The gate surface po- 
tential and the electron density of states provides the means to compute the 
electron densities. The density of states for electrons is a Fermi distribution 
but, far away from the Fermi energy, the distribution can be approximated 
by a Boltzman distribution. The resulting electron density is 

N = N e qi,/kT (6) 

where ij) is the gate surface potential. Combining this with Equation 5 the 
drain current is 

/ = I e KV ^e- v ^(l - e-^" 3 ) (7) 

where (5 = kT/q = 25m V and 

/„ = ^■DN e-' t ' o/kt . 

For small Vd s , the channel acts like a linear resistor. As Vds increases the 
channel gets pinched off near the drain. Further increases in Vds pinch off 
the channel completely and cause the channel to separate from the drain. 
This separation reduces the effective length of the channel below /. With the 
channel pinched off, variations in Vd s do not effect the electron density; the 
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channel current becomes largely independent of V^. The channel length 
does depend on Vd s and this small dependence affects the drain current 
and accounts for the slight slope in the I-V characteristics in the saturation 
region. 



3.1.2 MOS Specifications and Limitations 

The analog VLSI circuits are produced by the MO SIS foundry using 2 fim 
technology. Most publications for vision applications of analog VLSI do 
not reveal specific numbers characterizing the performance of these devices. 
However, some data can be found[l] or inferred from the MOSIS design 
specifications. The channel length / is about 1.5 /jm; oxide thickness is 125 A. 
For a silicon dioxide gate, the capacitance is about O.lpF. The factor, k, 
may range from 0.55 to 0.73 with 0.7 being a typical value. The current Iq is 
approximately 1.5xl0 -7 nA. Gate voltages, V gs , for subthreshold operation 
are generally between 0.3 and 0.8 volts with the corresponding drain currents 
of 7xl0 -4 and 8xl0 2 nA respectively. (Note that some of the numbers may 
seem inconsistent. Iq was deduced from a device, with k = 0.676 ([1], page 
38) but the drain currents were computed with k = 0.7.) Normal threshold 
for the MOS device is roughly V gs > 1 Volt. The device behaves similar to 
a current source when in the saturated region for Vd s > 100 mV. 

The primary limitation of MOS devices operating in the subthreshold 
region arises from the inability to provide a consistent threshold voltage 
across the chip die. This is the problem of device mismatch. The threshold 
voltage is part of <po in the previous section. Because of the exponential 
dependence, small variations of <f>o can introduce large variations in Iq. For 
instance, if (f>o — qVo and Vq — lOmV, the variation in Iq is nearly 33%. For 
transistors that are physically close to one another, a typical variation is 
±20% [1] although variations of 100% may in fact be more representative[22] 
for device mismatch. 

Computations that use differential amplifiers, such as derivatives, are 
sensitive to variations in Iq. Small differential signals may be overwhelmed 
by the transistor mismatch and, consequently, the circuit design must min- 
imize this effect. Of course biological systems have device mismatch and 
these systems generally do fine. As a scientific endeavor, the mismatch is 
acceptable; as an engineering endeavor, the mismatch is problematic and 
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Figure 3: Hexagonal lattice of pixels. Each pixel contains a photodetector 
of area a, the black square, surrounded by local processing circuits. Each 
pixel has an area of A (A > a). 

hampers development of a useful vision system. 

3.2 MOS Vision Algorithms and Devices 

In this section two of the higher-level analog VLSI circuits will be presented. 
These circuits are the silicon retina and the resistive fuses. These circuits 
utilize some common elements, such as the phototransistor and the resis- 
tive network for smoothing, and a common layout structure. These shared 
structures are discussed first as background for the subsequent discussion of 
the two higher-level circuits. 

3.2.1 Common VLSI Structures 

Figure 3 shows the typical layout for the analog VLSI circuits. Each 
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Figure 4: a) The phototransistor circuit[l]. b) A pnp phototransistor device 
fabricated with n-well CMOS. 

circuit consists of a one or two dimensional grid of pixels. The inter-pixel 
spacing is defined as L\ the pixel area as A(= L 2 ). Each pixel is comprised 
of a phototransistor and additional local-processing circuitry. Each photo- 
transistor has size / and area a(= I 2 ). The area-fill-factor is tja and is defined 
as a I A. The number of pixels along each linear dimension is N. The two 
dimensional grid is arranged as a hexagonal lattice by displacing alternate 
rows by L/2. 

The phototransistor circuit and device are shown in Figure 4. The pnp 
transistor has a photosensitive base region which produces a current at the 
collector that is proportional to the incident light intensity. This photocur- 
rent is fed through one (or two) diode-connected p-channel MOS device. 
The output voltage for this circuit is proportional to the logarithm of the 
photocurrent, 

Voutput = Vdd ' w (t)> 

K io 

where / is the current through the phototransistor's collector. The loga- 
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Figure 5: A resistive network. 

rithmic compression increases the usable range of incident light intensities. 
Typically V outpu t is 1 to 2.5 Volts below Vm. This corresponds to a current 
range of roughly 5 orders of magnitude. The smallest detectable current is 
about 10 -5 nA or 10 5 photons/second. 

With n-well CMOS, for the pnp phototransistor, the base is the well itself 
and the emitter is fabricated from a p-diffusion step. The p-type collector is 
the substrate (Figure 4b) and it is electrically grounded. The n-well process 
produces parasitic phototransistors whenever a well is deposited. To avoid 
unwanted photocurrents, the die is shielded everywhere except at the desired 
phototransistors. The second metal layer serves as the shield. 

Figure 5 shows the third and final common structure of this section: 
the resistive networkfl]. The resistive network performs a smoothing oper- 
ation useful for early vision and is used in both the silicon retina and the 
discontinuity-detecting resistive fuses. The one-dimensional, continuous re- 
sistive network solves for the minimum of Equation 1. With a resistivity 
per unit length of p, a conductivity per unit length to ground of 7, and an 
input voltage of g(x), Kirchoff's current and voltage laws are: 

^1 = j[V(x) - g(x)] 
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^l = ,/(x) (8) 

These two equations yield Equation 2 provided A = 1 and G — p*y. The 
Green's function is 

V(x\x ) = | Xe( ,-^)/L X<XQ W 

where the space constant (or "smoothing width") is L = Ij^ffp. Equation 9 
shows that a unit impulse at x = xo diffuses throughout the network with an 
exponential fall-off. If a capacitance is added between V(x) and ground, the 
network converges to a solution with a time constant of roughly r = C/G. 

Figure 6 shows an implementation of a resistor and its adjacent nodes for 
the resistive network[l]. The resistor is comprised of the transistors labeled 
Q\ and Q2 in Figure 6b. To find the small-signal resistance between V n and 
V n+ i, assume that nV gm - V m = Vj for m € {n,n + 1}. Transistors Q\ and 
Q2 of Figure 6b will be biased identically and the resulting current will be 

/ = V^tanh[ (K ~J ra+l) ]. (10) 

For small signals (x < 0.2), tanh(x) = x and the resistance is 

2/? 



R = 



I eW 



The resistance can be modified by varying the voltage Vj. This voltage Vd 
is the gate-to-source voltage of Qd shown in Figure 6a. For this circuit, the 
current mirror, QS - QA, keeps the currents through Ql, Q2, and Qd all 
equal to h/2. The diode connection at Q2 has voltage V n and consequently 

k = he {«v gn -v n )l0 = Ioe v d /^ 

Or, in terms of the bias voltage Vb, the resistive network's resistance is 

4/3 



R = 



J e*n//T 



This analysis assumes that all the transistors are well matched and operat- 
ing in the saturation region. The measured I-V characteristic!!] for the hor- 
izontal resistor shows that the small-signal assumption is valid for roughly 
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Figure 6: The resistive network[l]. a) The bias circuit for the nth network 
node. V n is connected to a node in the network and V 9n is connected to all 
the transistors adjacent to the node. For a hexagonal lattice V 9n is attached 
to 6 gates, b) The transistors Ql and Q2 model a resistor between nodes n 
and n + 1. 
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V n - K+i = lOOmF with V n+l = 2.5V. Within this 100 mV range, the 
resistance is linear and ranges from between about 0.2MQ, and 2xl0 4 MO 
(for 0.4V <V b < 0.8V). 

Equation 10 shows that the current between nodes V\ and Vi of Fig- 
ure 6a saturates when |Vi - V2I > 0.0. This is a crude type of discontinuity 
detector. At saturation, the current is / = Ioe^ Vd and the effective resistance 
approaches 00. The two voltages across the resistor are no longer related; 
smoothing no longer occurs. A more sophisticated discontinuity detector is 
discussed in the subsequent section on resistive fuses. 



3.2.2 Silicon Retina 

The retina is the first stage of the image processing that ultimately converts 
the image produced by the eye's optical system into moving, colorful, recog- 
nizable objects. The optical signal is converted to an electrical signal by the 
photoreceptors that line the back side of the retina. Subsequent layers of 
retinal cells: amacrine, horizontal, bipolar and ganglion, further process the 
electrical signal until the ganglion axons send the signal along to the lateral 
geniculate nucleus. Presumably, the different cell types are associated with 
different computations. Some of the ganglion cells may produce something 
similar to the convolution of a Laplacian of a Gaussian. The horizontal 
cells produce the surround region and the bipolar cells produce the center 
region. The ganglion cells produce a center-surround response by subtract- 
ing the bipolar and horizontal cell outputs. The amacrine cells respond to 
time- varying signals. The resulting computation yields those regions in the 
image that change spatially or temporally. 

The silicon retina[23] is an attempt to duplicate, at Marr's computational 
and algorithmic levels, the simplified retina described above. A phototran- 
sistor imitates, in an approximate way, the human retina's photoreceptor 
response. A center-surround algorithm is used by the silicon retina to find 
spatially varying regions. The horizontal resistive network provides the sur- 
round region and a differential amplifier subtracts this from the photorecep- 
tor response. The resulting signal is clocked off the retina for display. 

Figure 7 provides an overview of the silicon retina circuitry. Most of 
the elements have been described previously. The phototransistor circuit 
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Figure 7: The silicon retina[l]. The output is the difference between the 
input voltage and the smoothed output of the resistive network. The two 
differential amplifiers shown (biased by Vj, and V^) are transconductance 
amplifiers [1]. 
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provides a voltage for the follower-connected differential amplifier. This 
amplifier acts like a conductance to couple the phototransistor voltage to 
the resistive network node. The conductance is determined by voltage V g 
and has a range similar to the reciprocal of the horizontal resistance. The 
resistive network couples the different pixels by a resistance determined by 
Vr of Figure 5. 

Several aspects of the silicon retina are variable. The smoothing width 
is determined by L of Equation 9 which, for the exponential dependence 
of R and G on gate voltage, is controlled by the difference of voltages V g 
(Figure 7) and Vr (Figure 6). Typically X ranges between 0.1 and 10.0 
pixels. The time response r is controlled by G and thus V g as well as the 
fixed capacitance C. The capacitance, as mentioned in Section 3.1.2, is 
about 0.1 pF. Given the range of G the network's time response can be 
varied between about 1ms and 10ns or 1kHz to 100 MHz (other capacitance 
probably makes this an unrealizable speed). The smoothing width and time 
response are independently variable. 

The differential amplifier at the output computes the difference between 
the phototransistor voltage and the smoothed resistive network voltage. 
This is the center- surround computation. The output from the amplifier 
is enabled by the voltage Vj,. 

The silicon retina contains 48 x 48 pixels. Each pixel is comprised of the 
phototransistor, the circuitry shown in Figure 7, and the resistive network. 
A pixel is roughly 100 x 100 fim 2 and the phototransistor occupies 10 % of 
the pixel area. Besides the circuitry for each pixel, the silicon retina contains 
devices to access each pixel's output current. Any one pixel's response can 
be observed over time or each pixel can be sequentially clocked out for video 
display. A single pixel's intensity, time, and edge response is qualitatively 
similar to measurements made on biological retinas[23]. When each pixel 
is sequentially clocked out, each pixel should reach equilibrium at the 30 
Hz frame rate. If the retina is scaled from 48 x 48 to 512 x 512, the speed 
for each pixel's computation, as determined primarily by the time response 
of the resistive network, need not increase. Only the sequential clocking 
circuitry must be sped up. 
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3.2.3 Resistive Fuses 

The use of resistive fuses[24] attempts to implement a solution to important 
problem of discontinuity detection[14]. The function of a fuse is to prevent 
the smoothing between neighboring sites. Once broken, the fuse marks 
the location of the discontinuity and prevents further smoothing between 
the neighboring sites. The smoothing is avoided by greatly increasing the 
resistance between the sites. 

Equation 1, embodying the smoothing problem, is modified by the ad- 
dition of discontinuities. In discrete form the energy for smoothing with 
discontinuities is: 

Ei = £ {(/,• - /;) 2 (1 -lij) + <*(fi ~ 9i? + PUj) (11) 

Here g is the surface property data, an input; / and / are the smoothed 
surface property data and the discontinuities respectively, the outputs. The 
total energy is the sum of Ei for all sites i. The field I is a binary field so 
that when Uj = 1 the first term in Equation 11 contributes nothing to the 
energy and the third term contributes /?. /3 is the penalty for turning on a 
line. As a function of fc — fj, the minimum of Ei is quadratic with Uj = 
until (/, - fj) 2 = f3 where kj = 1 and Ei = (3. A similar dependence can be 
implemented in analog VLSI. 

The previous analysis showed that when A/ > \f$ the energy is con- 
stant; prior to that point, the energy is quadratic. A fuse has just that 
property. In analog VLSI, a fuse is implemented by making the voltage V\, 
of Figure 6b a function of the voltage difference between nodes in the resis- 
tive network. When the voltage difference is larger than some threshold, for 
an ideal fuse R = oo and Vj should be OV . An approximation to this has 
been implemented[24] and is shown in Figure 8. For this circuit, the fuse 
current is 



■l fUSe — cy 



*-*->(*&) 



tanh(^). (12) 



The current Ib determines the resistance for smoothing; the current Ia 
determines when the resistance breaks (really, begins to break). Both are 
adjustable but not separately for each node in the circuit. 
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Figure 8: A resistive fuse implementation [24]. 
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Figure 9: A resistive fuse network[24]. 
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These fuses have been used in an eight node network. Figure 9 is a block 
diagram for the network. Each of the eight input voltages d, are variable 
and each smoothed output voltage /,• is accessible. The conductances gi 
are analogous to a of Equation 11 and are a measure of the expected noise 
or "trust" of the input data d,. Each of the gi are variable and, when the 
d{ are sparse, some of the </, will be zero. The currents Ib and Ia are 
controlled by voltages Vb and Va, respectively. The network has been shown 
to smooth data and break the smoothing to mark discontinuities[24]. A two- 
dimensional circuit with 400 nodes is in development. 



3.3 Discussion 

One motivation driving Mead's work appears to be the desire to study bio- 
logical systems by building analogous systems in hardware. Building these 
hardware systems serves two complementary roles[l](page 8). They attempt 
to provide computational neuroscientists with a facility which allows exper- 
imental verification of the neuroscientist's hypotheses. Additionally, devel- 
opment of these hardware systems attempt to provide insight into the prop- 
erties of collective systems. These are the main issues guiding the research 
on analog VLSI. 

Yet, from a computational neuroscientist's view, hardware systems do 
not provide the required flexibility for algorithm development. So far, the de- 
sign of the hardware has been guided by the results from the computational 
neuroscientists; not the other way around. The hardware implementations 
are approximations to the computational theory. Discrepancies between 
hardware results and theory reveal the inadequacy of the hardware imple- 
mentation and representation. The discrepancies have not been attributed 
to the computational theory. The analog hardware does not seem to have 
satisfied the goal of providing neuroscientists with an experimental facility. 

As more tools, techniques and experience develops, analog VLSI may 
eventually contribute to the computational theory. These developments, 
expressed as a set of VLSI "standard cells" or modules, may allow the hard- 
ware designer to rapidly modify an algorithm thereby reducing development 
time. The resistive network may be an example of an emerging standardized 
module. Another impediment to analog VLSI's contribution may be price. 
Until such a time that these impediments are circumvented, the primary 
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tool of computational neuroscientists will remain software simulation. 

These impediments are not the major factors limiting the use of analog 
VLSI in a general vision system. The fundamental problems that ultimately 
limit analog VLSI's usefulness in vision are discussed in Section 3.3.3. 

The biological focus of analog VLSI systems hinders engineering of a 
robust vision system. A biological model may demand local processing 
with three-dimensional circuitry in nature and, currently, two-dimensional 
circuitry on silicon. However, two-dimensional local processing has hor- 
rible scaling properties as computational requirements and pixel densities 
increase. Another hindrance, based on biologically acceptable power con- 
sumption, is the use of subthreshold MOS devices. In the subthreshold 
regime, owing to the exponential dependence of drain current on gate volt- 
age, MOS devices are more difficult to manufacture with uniform properties. 
Consequently noise issues must be confronted. These examples of engineer- 
ing difficiencies are discussed in more detail below. 



3.3.1 Adaptive Retina 

Although framed as a need to adapt the silicon retina to different light lev- 
els, the adaptive retina[25] is really an attempt to eliminate the problems 
of differential offset in the subthreshold MOS devices[26]. The mismatch 
between MOS device parameters proves particularly disruptive when com- 
puting derivatives. The simple differential amplifier can have a current- 
mirror with currents differing by 100% [1] and a 20% difference is common. 
Such differences can easily confound the center-surround computation of the 
silicon retina. 

Figure 10 is a schematic of the adaptive retina. The adaptation serves 
to counteract the effect of mismatched devices. When the phototransistor 
emitter and the floating- gate [27] are exposed to UV radiation, the adaptive 
retina chip is illuminated with a uniform light intensity and the resistive 
network is set to compute a global average. Under these conditions, the 
UV radiation allows a small current to flow through the silicon-dioxide in- 
sulator between the floating-gate and the phototransistor's emitter. This 
current charges the floating-gate so as to reduce the surface potential of 
the p-channel within the floating-gate MOS transistor. Once equilibrium is 
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reached, V ou t pu t will be near the node voltage of the resistive network and 
hence equal for all pixels. 

The adaptive retina turns out to exhibit properties similar to biological 
systems. Similar to biological retinas, the silicon retina can adapt to different 
light levels and also display "after-image" phenomenon [25]. One cannot 
argue with the results. Not unexpectedly, a biological approach reproduced 
biological results. Such results do not necessarily bring a working vision 
system nearer to reality. 



3.3.2 Practical Resistive Fuse 

For an analog VLSI circuit such as the resistive fuse, the primary goal has, 
once again, not been to develop a working vision system. Seemingly, and 
for good reason, the focus has been on understanding vision and vision algo- 
rithms. Some of the unanswered questions in vision and, particularly, inter- 
mediate vision are of a fundamental nature. In the case of the discontinuity- 
detecting resistive fuses, questions regarding parameter specification are ex- 
ceedingly difficult and remain the major impediment to further development. 
The resistive-fuse system works under supervision when the parameters can 
be controlled; however, unsupervised, success is unlikely until parameter 
estimation issues are resolved. 

The primary benefit resulting from resistive fuse hardware may be the 
speed which might allow exploration of the parameter space. Yet, in pa- 
rameter estimation, the need to quickly modify the algorithm may show 
such a hardware approach to be ill-suited. The advent of a chip integrat- 
ing intensity edges with surface property data to detect discontinuities [26] 
may achieve more success. When using intensity edges to guide the search 
for discontinuities in surface properties, the specification of parameters is 
significantly less critical[5, 28]. 



3.3.3 A Vision System in Analog VLSI? 

Most likely, general working vision system in hardware will require two at- 
tributes: lots of pixels and lots of computation. These two attributes are 
lacking in the present analog VLSI implementations and are addressed by 
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the issue of scaling. Scaling refers primarily to increasing the total number of 
pixels and increasing the processing associated with each pixel. Large num- 
bers of pixels are required to do anything more than just crude recognition 
and navigation and increasing the processing power is necessary when edge 
detection, stereo, motion, and discontinuity detection all must be performed. 

As shown in Figure 3, analog VLSI technology positions the processing 
locally with each pixel. As the amount of required processing increases, the 
fractional area, r\A occupied by the phototransistor diminishes and, for the 
same number of pixels, the chip die size increases. Optical resolution and 
efficiency are both degraded when the local processing circuitry increases. 
Already, tja is significantly smaller than current CCD technology utilizes. 
For the silicon retina, tja = 0.1 (roughly); for the resistive fuses, tja is even 
less. Both of these analog VLSI systems are very low level. Once circuitry for 
stereo and motion are added, as well as processing needed by intermediate 
vision, the optical performance may be reduced to unacceptable levels. 

An analog VLSI layout designed to model more than one early vision 
module with two-dimensional local-processing would be highly non-modular. 
As currently formulated in computational vision theory, each of the individ- 
ual vision modules requires the pixels to be locally interconnected. This 
interconnectivity is designed to impose the smoothness constraint on sur- 
face property data. With several vision modules implemented at each pixel, 
the interconnection layout may be prohibitedly complicated. A modular ap- 
proach would have a chip (or seperate wafer region) for each vision module. 
Phototransistor or silicon retina output could be shared by all the chips. 

When the number of pixels is increased (and/or the local processing re- 
quirements increase), designers of analog VLSI hardware must address the 
issues of wafer scale integration and, consequently, fault tolerant design. Bi- 
ological systems have largely resolved both these issues. However, in circuit 
design, these issues are far from resolved and consequently vision hardware 
with analog VLSI must await further developments. In addition, wafer scale 
integration further increases the design time and device cost. 
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3.3.4 Review 

The previous sections have detailed several of the disadvantages and suc- 
cesses of analog VLSI for vision. The silicon retina has been successful in 
duplicating, qualitatively, many of the characteristics of biological retinas. 
The adaptive retina successfully addressed the problem of MOS mismatch 
in the silicon retina and yielded "after image" effects similar to biological 
systems. At the computational theory level, both these devices as well as the 
resistive fuse and the stereo correlator[29] produced results consistent with 
theory. These systems used low-power, subthreshold MOS devices almost 
exclusively. 

On the negative side, several problems with both analog VLSI and the 
design methodology for analog VLSI will hinder development of a vision 
system. In the subthreshold regime, MOS devices are difficult to match 
and, consequently, they demand robust circuit design. Redesign of analog 
VLSI circuits is difficult, because, due to its newness, analog VLSI does not 
have a standard set of circuit modules. With time both these problems may 
be reduced. 

A significant problem with analog VLSI systems is the adherence to a 
local processing layout. As the local computation requirements increase, 
the optical resolution and response are reduced since the phototransistors 
occupy a smaller fraction of the pixel and are spaced further apart. Also, 
when additional vision modules are implemented in hardware, the local pro- 
cessing requirement reduces the modularity of the system. Finally, as the 
number of pixels increases, the circuits demand a larger portion of the silicon 
wafer. With larger wafer size, point defects will increase and fault tolerant 
circuits must be designed. These factors increase the cost, complexity and 
development time of analog VLSI systems. 



4 Digital Circuits for Vision 

General- purpose, serial and parallel, digital computers are used to imple- 
ment vision algorithms. Because they are general-purpose computers, much 
of the hardware in these machines is not related to the specifics of vision 
tasks. The result is a computer with lots of flexibility and very little speed. 
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Although a research environment may not need real-time processing capa- 
bility, most applications could benefit from vision algorithms running in 
real-time. 

The processing required for real-time vision computations is immense. 
For a 512 by 512 image at 30 Hz the serial processing rate is nearly 10 MHz. 
This magnitude processing rate cannot be performed on a serial computer 
designed for general-purpose use. Even for a massively-parallel Connection 
Machine operating simultaneously on every element of the image, achieving 
a 30 Hz rate is difficult. Such a rate may be obtained on the Connec- 
tion Machine for a fully configured (64k processors) machine, running an 
assembly-language coded version of an edge detector. Using a Connection 
Machine whenever a real-time vision task is required is beyond ridiculous. 

Specialized hardware for vision may enable real-time computation. The 
previous section detailed the use of analog techniques for vision. This sec- 
tion examines an example of digital techniques for vision[2]. This section 
reviews the previous work, its goals and philosophy, and presents some of its 
algorithms and circuits. A discussion of the 3x3 convolver chip is detailed 
as well as the problems and inadequacies of the digital approach to vision. 



4.1 Goals and Approach 

The development of the image processing IC system[30] was guided by four 
major goals. From the standpoint of a vision researcher, the most important 
goal was that the system be able to perform image recognition on two- 
dimensional images. Another goal was that the system operate in real-time 
so that additional hardware of frame buffers would not be required. In 
addition, the design of the vision system should utilize modules that perform 
fundamental vision algorithms. In this way, the modules can be readily 
configured to solve different problems. The final goal was that development 
time be minimal. 

In order to satisfy these goals, a strict hierarchy for the hardware design 
was employed. Each level in the hierarchy would contain those circuits 
required by one or more modules of a higher level. In this way duplicate 
design could be eliminated provided that the circuits could be generalized. 
Design at a higher level would entail primarily interfacing the "building- 
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block" modules from the lower level. This design hierarchy also serves to 
reduce development time by utilizing identical modules at each level. 

The highest level in the hierarchy is the image processing task to be 
performed. Examples are the recognition task or image enhancement. These 
tasks are performed by combining chips such as clocks and buffers with the 
basic image-processing chips from the second level. 

The second level in the hierarchy contains the chips. Each chip is a 
complete functional block that performs a low-level vision task. Examples 
of chips are linear and logical convolvers, look-up tables, sorting functions, 
and contour tracers. Most of these chips accept a stream of input and 
perform the necessary delays required by two-dimensional image processing. 
Delays of 512 are required to align the rows of a 512 x 512 image when 
convolving spatially. 

The third level is composed of macrocells. Macrocells are the blocks that 
compose the chips. Common functions include storage elements (RAMs, 
ROMs and line delays), bit-sliced data paths, and controllers. To speed the 
design, several tools automatically layout or assemble the bit-sliced data 
paths and program the ROM/PLAs. 

The lowest level is composed of registers, adder cells, and ROM cells. 

The images for the recognition system were obtained from a 512 by 512 
"broadcast quality" video image with a frame rate of 30 Hz. Figure 11 shows 
an overview of the recognition system. Edge detection is the first stage in 
the image processing and, since it is the only processing step comparable 
to analog VLSI, the discussion will focus on it. The pattern matching and 
feature extracting stages are highly dependent on the assumption of two- 
dimensional images. This assumption is not consistent with a general vision 
system and as a consequence these two stages are not discussed here. The 
other processing step, contour tracing, although somewhat skew from the 
discussion of edge and discontinuity detection, is presented here briefly. 



4.2 The Chips 

The edge detection and contour tracing routines of Ruetz[30] are computed 
with several chips. The edge detection is performed by first smoothing the 
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Figure 12: The 3x3 Linear Convolver [2]. 

input data with a low-pass filter to eliminate noise, high-pass filtering with 
a threshold to identify the edges, and then "bloating" and subsequently 
"eroding" the edges to produce closed contours. Each of these operations 
can be performed by convolving the input with a suitable mask. A 3x3 linear 
convolver chip was developed to perform the low and high pass filtering. The 
"bloating" and "eroding" were perform by a 7x7 logical convolver chip. In 
addition a contour tracing chip was developed. These chips are discussed in 
the following sections. 



4.2.1 Convolution / Filtering 

The convolver chip performs a real-time convolution on a 512 pixel / 512 line 
image with a 3 x 3 mask. The chip can be cascaded so that, for the binomial 
convolution discussed in Section 2.1.1, any binomial series can be produced. 
Besides binomial convolution, several other types of low-pass filters[30] can 
be utilized since the convolver chip has programmable coefficients. 
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Figure 12 shows a block diagram for the convolver chip. For all pixels 
fij in the image and for the FIR hij, the convolver chip computes 

The convolution is performed not by shifting the image data; rather, the 
coefficients /i,j are shifted. Each of the accumulators (ACC's) (Figure 12, 
bottom) computes one complete convolution and, upon receipt of the "done" 
signal, outputs the result. The results appear at the clock rate with each 
accumulator's output delayed by one clock cycle (z -1 ) from the previous 
accumulator. The ACC's perform three additions, one for each line in the 
convolution, to obtain the convolved result. 

The arithmetic controller arranges for each row of three multiplying ac- 
cumulators (MACs) to compute one line in the convolution. The line delay 
macrocells, z~ L , delay the image data by one line (512 pixels) for each sep- 
arate row of MACs. The controller cycles through the coefficients for one 
row. For instance, the top row of MACs always computes the top row of the 
convolution since the coefficients appear as {/in, /112, h\z, ^11, • • •}• Similarly 
the bottom row of MACs always computes the bottom row of the convolu- 
tion, h 3 j j e {1,2,3}. The MACs within each row see the same data but 
have the coefficients /i t j delayed by one clock cycle. Like the ACCs, upon 
receipt of the "done" signal, a MAC outputs its result which is subsequently 
summed by an ACC. The MACs within a row produce an output at one 
third the pixel data rate. 



Line Delay The function of the line delay is to accept one pixel and 
output the pixel delayed by one video line. Figure 13 shows the line delay 
architecture. The delay is implemented by shifting a pointer to the data 
rather than moving the data itself. Eight consecutive pixels of eight bits 
each are de-multiplexed to fill one 64 bit register. This register is then stored 
in a 63 x 64 bit RAM at, say, location n. The location n is incremented by 1 
at one-eighth the pixel data rate (10 MHz). Simultaneous with writing the 
input register at site n, site n + 1 in the RAM is read into the 64 bit output 
register. 8 bit chunks from the 64 bit output register are latched on to the 
output line. This implements the 512 pixel delay. 

The line delay architecture has several advantages. The multiplexing 
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Figure 13: The line delay chip[2]. 

serves to reduce the 63 x 64 RAM rate to 1.25 MHz (for a 10MHz video 
signal). Consequently, lower power devices can be used. Also, the RAM 
is rectangular which makes laying out the line delay macrocells with the 
convolver MACs and ACCs simpler. 



4.2.2 Dilation 



Dilation and erosion are two computations performed on 1-bit edge maps. 
Dilation transforms a solitary one in the edge map into a region of ones. 
After the dilation, previously isolated ones may be connected to one another. 
This is one approach to filling gaps in edge detectors. After the thickening, 
the edge is then eroded until a thin contour is obtained. Ruetz[30] has 
developed a 7x7 logical convolver chip to perform this task. 



The operation of dilation can be expressed as below. 

9x,y = V*^,m\Mn f mAJyJJjx—n,y—Tn) 



(14) 
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Figure 14: The 7x7 logical convolver chip[2]. 
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h n>m is the 7x7, 1-bit dilation mask. This mask is simply ANDed with 
the delayed, 1-bit input data /. Figure 14 shows a block diagram of the 
logical convolver. The 7 1-bit delay lines are formed from the 8-bit delay 
line discussed previously by connecting the output on line n to the input on 
line n + 1. The output from each of the 7 delay lines is stored in a 7 bit 
shift register. These 49 bits are then ANDed with the pre-stored mask and 
ORed together to obtain the final result. 



4.2.3 Contour Tracing 

Labeling contours is a common vision processing task. Once the contours in 
an image have been labeled, the contours can be broken into features such 
as straight lines, arcs, and corners and the length of the contour can be 
determined. Several recognition schemes require labeled edges[31, 32] and 
considerable effort has been spent on efficient contour following and label- 
ing for the Connection Machine[33]. Most image contours are not simply 
biconnected and often will contain T-junctions and X-junctions[34]. 

Contour tracing is the first step for labeling contours. Ruetz simplifies 
the tracing problem by making several assumptions: each image contains one 
and only one contour, the contour is closed, and the contour is biconnected 
(no T or X junctions). With these assumptions a very simple, finite-state 
algorithm[35] for tracing can be employed[30]. 

Figure 15 show the architecture for the contour tracing chip. For real- 
time computation, the entire edge map must be buffered and, consequently, 
the 512x512 image was down-sampled to 128x120 pixels. The decimation 
function for the down-sampling could not be determined. To begin the trac- 
ing the controller searches for the first non-zero pixel by stepping through X 
and Y coordinates. Once this pixel is found, the controller checks pixels in 
its neighborhood in a deterministic order. The order ensures that, indepen- 
dent of contour direction, the contour will be traced. Once a neighboring 
pixel is found with a non-zero value, its 6X and SY offsets are noted and 
the process repeats. With the starting (X,Y) position known, the series of 
offsets determines the path of the contour. 
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4.3 Advanced Chips for Image Processing 

Since his work while at Berkeley, Ruetz has moved to LSI Logic Corp. which 
now produces several advanced image processing chips. The chips include 3 
x 3 and 8x8 8-bit convolvers with a programmable FIR, line delay chips with 
variable delays of 512, 1024 or more, binary filters and template matchers, 
and rank- value filters. Some of these chips are briefly discussed below. 

The Multi-bit Filter (L64240) from LSI Logic Corp. can perform two- 
dimensional convolution with an 8 x 8 window size at 20 MHz. For an 8 x 8 
window, the input is 8 8-bit streams; the output is a 40-bit convolution over 
the window. The FIR coefficients are individually programmable. Inputs are 
provided to allow cascading of the chips, to facilitate increasing the window 
size, and to manipulate streams with more than 8-bits. In addition, the 
window shape can be re- configured to 1 x 64, 2 x 32, and 4 x 16, and the 
output can be scaled or delayed as desired. The chip price is roughly $1,300. 

Another chip is the Variable- Length Video Shift Register (L64211). This 
chip takes an 8-bit input stream at 20 MHz and produces up to 8 8-bit 
outputs. Each output is delayed from the previous output by a length that 
is programmable between 12 and 516 pixels. This chip provides a means to 
shift the serial video signal by individual scan lines thereby providing the 
two-dimensional configuration for convolution. The cost is $115. 

Figure 16 shows two possible configurations of 8 x 8 convolution chips 
and line delay chips to produce a 16 x 16 convolution. Another approach 
that avoids the expensive 8x8 convolvers is to cascade many 3x3 con- 
volvers to build up the required window size. The cascading scheme uses less 
convolver chips for the same window size but is limited to masks that are 
the convolution of smaller masks and introduces a longer overall time delay. 
In addition, multiple scale output is available when chips are cascaded. 

The Binary Filter and Template Matcher (L64230) is analogous to the 7 
x 7 logical convolver discussed previously. This chip can perform the dilation 
and erosion computations in addition to pattern matching at 20 MHz. The 
chip can be configured as a 32 x 32 mask that requires 32 1-bit input lines. 
The output is 16 bits. 

The fourth chip is the Rank- Value Filter (L64220). This filter sorts the 
pixels in an 8 x 8 (4 x 16, 2 x 32, or 1 x 64) size window and returns the pixel 
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Figure 16: Two Multi-Bit Filter chip configurations for 16 x 16 convolution 
windows, a) A fully programmable 16 x 16 window with 40 bit output, b) 
Cascading two filters produces a 16 x 16 window from the convolution of 
the two 8x8 windows. 
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value of a specified rank. A typical use would be as a median, minimum, or 
maximum filter. The inputs and output are 12 bits and the chip can operate 
at up to 20 Mhz. 

The L64720 (MCP) Video Motion Compensation processor computes the 
correlation between 16 x 16 (or 8x8) date blocks from two video images. 
For 16 x 16 data blocks, the L64720 achieves a 30 Hz performance on 352 
x 288 images. The data blocks are correlated over an offset range of -8 
to 7 pixels in both x and y. Additional devices can be used to increase 
the correlation offset range. This processor can implement the stereo[36] 
and motion[37] algorithms currently used on the Vision Machine[38]. The 
correlation is computed for every offset between the two data blocks and 
the offsets in x and y that minimizes the total of the absolute values of 
the differences between the data block pixels is returned. The cost for the 
L64720 is roughly $200. 

LSI Logic sells two chips that are more general and useful then the 
contour following chip described by Ruetz[30]. One chip is a Histogram / 
Hough Transform Processor; the other chip implements contour tracing. The 
contour tracing chip can find all contours in an image. Output includes the 
slope (over 2 pixels) and curvature (over 3 pixels) of the contour as well as the 
"object" position, perimeter, and area. With the addition of external RAM, 
the contour tracer can process binary images measuring 1024 x 1024 pixels. 
Rectangular subregions of the image can be scanned to speed processing 
when, for instance, an object is being tracked. This contour following chip 
could serve as a preprocessor for line and curve finding algorithms and, 
ultimately, a recognition algorithm. 



4.4 Discussion 

For vision research, the primary inadequacy with the work of Ruetz is the 
limitation to two-dimensional objects; the vision problem solved is simplis- 
tic. As currently formulated, the general problem of vision requires analysis 
of the scene surface properties, such as depth and motion modules. These 
modules help the determination of object boundaries when images are com- 
plicated. The restriction by Ruetz to two dimensions makes his recognition 
problem solvable because the limited domain allows the use of simplifying 
assumptions. 
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The object edges found by Ruetz's edge detector axe noiseless. This 
lack of noise is not a consequence of an efficient, optimized edge detector; 
rather the input image quality is so high that the edges produced are perfect. 
Ruetz shows examples of recognition results with images containing a single 
object. Each image has one and only one contour. Although T-junctions 
and X-junctions exist before the final processing stage, the final edge map 
does not provide any such junctions and the contour is closed. Not a realistic 
case, but it greatly simplifies the contour following and feature extraction 
process. 

Ruetz's contour tracing chip immediately follows the edge detector. This 
is possible because the edges are noiseless. In a more sophisticated system, 
additional processing might be required, such as contour grouping and con- 
necting, before the contour tracer would have a connected sequence of pixels. 

Still, several aspects of Ruetz's work are significant. The 3x3 linear con- 
volvers and line delay circuits are generally useful for early vision tasks. 
Many early vision problems are formulated to use local processing in or- 
der to increase the parallelism. While pixel-serial digital techniques are not 
highly parallel, the convolvers can perform the local processing demanded 
by the vision algorithms. For these digital chips, accuracy is not a prob- 
lem. The chips use 8-bit data with accuracy no worse than most software 
implementations where 8 bits are used. Another significant aspect is the 
ability to cascade the convolvers thereby computing, for instance, binomial 
convolutions of any order. Unfortunately, for a 3 x 3 binomial convolu- 
tion with coefficients {1/4,1/2,1/4}, in order to approximate a Gaussian 
with a standard deviation of 8, 127 3x3 convolutions must be performed. 
(Subsampling the image can significantly reduce the number of convolutions 
required to approximate a Gaussian with a large standard deviation.) 



5 Analysis 

Current analog vision processing has proceeded by mapping algorithms onto 
available VLSI circuit elements such as MOS transistors. Implementing a 
resistive network requires implementing a resistor. The resistor and its bias 
circuitry are made from MOS transistors as Figure 6 illustrates. Edge de- 
tection requires a resistive network and amplifier for a center- surround com- 
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putation. Discontinuity detection requires fuses. But ultimately everything 
is made from transistors. What is really needed is a novel hardware device 
that directly computes early vision tasks. This then would truly become 
"vision hardware" just as the the bipolar and horizontal cells are specialized 
for vision. 

The question of local versus remote processing must be addressed at 
some point. Currently the analog VLSI developments at Caltech have uti- 
lized strictly local processing and, as was shown, these analog circuits scale 
poorly as the computational requirements increase. Eventually the optical 
quality will reach unacceptable levels with further increases in the compu- 
tational requirements and, consequently the computational circuitry must 
be removed from the photoreceptor portion of the chip. Further, by remov- 
ing the computational circuitry, special purpose, optical detectors, such as 
CCDs, could be employed. 

Note that relocation of the processing to a remote location is precisely 
what evolution has provided for biological systems[39]. The early compu- 
tational cells, horizontal cells, bipolar cells, etc, do not significantly reduce 
the resolution of the optical system. The resolution is primarily affected at 
the optic disk where the resolution is zero. The retina can maintain high 
resolution by utilizing three dimensions for the computational cells; inte- 
grated circuits have not been successful at utilizing three dimensions yet. 
The visual system for humans uses a large portion of the brain, yet the pho- 
toreceptors use only a tiny fraction. Given that the visual cortex is at the 
back of the brain, with an assumption that there is no remote processing 
and therefore computation is constrained to local processing only, our eyes 
would quite literally be located on the back of our heads. With this as- 
sumption, photoreceptors sparsely distributed throughout the visual cortex 
would clearly lead to unacceptable optical properties. 

Remote processing was successfully implemented in the digital vision 
processing. The CCD imager was independent from the computational 
chips. Cascading some 3x3 convolvers to modify the image smoothing would 
not mandate modifying the CCD imager and its associated optical system. 
The technology that allows for the remote processing is the circuit speed. 
The circuits are fast enough to allow real-time processing even when serial- 
izing the image data. If more computation is required, the digital chip speed 
may not be fast enough to maintain real-time processing. The alternative is 
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to increase chip speed or begin to parallelize the digital computation. Note 
that the convolver chips already perform nine or more multiplications in 
parallel. Additional parallelism may be obtained from simultaneous compu- 
tations on two or more pixels. Analog technology does not have a monopoly 
on parallel computation. 

Other than fully serializing the data as digital techniques currently per- 
form, remote processing is difficult to achieve. Consider two alternatives for 
a 512 x 512 imager: 1) process the image scan lines in parallel but serialize 
within a scan line[21], and 2) process all pixels in parallel. For the first 
alternative, 512 analog output lines are required. This number of output 
lines is well beyond present packaging technology. (Although as Yang[40] 
has demonstrated, 512 analog output lines may not be needed. The remote 
processing and area fan-out requirements can be achieved on the CCD chip. 
The binomial smoothing circuits are located adjacent to the CCD imager 
and occupy only at most 25% of the die size. This is an N-parallel, pipelined 
architecture as compared to the N x N parallel architecture of Mead and the 
serial, pipelined architecture of Ruetz.) The second alternative requires 512 2 
analog output lines and is even more difficult to build. The technology that 
allows remote processing for these two cases is unrelated to speed. The need 
to access all the output lines is the dominant difficulty and consequently 
packaging technology plays a prominent role. 



5.1 Parallel Hardware for Remote Processing 

This final section presents some ideas related to packaging for parallel com- 
putation. The packaging must be designed to allow for remote processing 
of a large number of analog signals. Remote processing avoids the following 
problems inherent in strictly local processing: 1) reduction of optical res- 
olution and efficiency, 2) non-modularity and non-expandability requiring 
complete chip redesign, and 3) fault-tolerant design for wafer-scale integra- 
tion. 

Ideally the imager should contain a large number and high density of 
photoreceptors. Assume that the imager is designed with criteria at the 
limit of processing technology. Define the area of a single photoreceptor to 
be a. At the limit of processing technology only a few devices can fit within 
the area a either on the imager or on another chip. Consequently, before any 
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processing can be performed there must be an area fan-out. For example, 
if area a can fit 4 MOS devices but an edge detector requires 20 devices, a 
fan-out of 5 is required. 

One approach to remote-processing is to provide a pin on the chip carrier 
for every pixel or, if only lines are being parallelized, every line. For either 
situation there are too many pins with too high a density at too low a power. 
The size of the pixels requires that the wiring be VLSI. 

The 3-D computer[41] suggests one possible scheme for remote process- 
ing. The 3-D computer stacks chips (each chip performs one processing task) 
using a 3-D wiring scheme. Each chip contains an array of N x N identical 
processing elements. Once stacked, each processing element in the array is 
connected to all the elements above and below it in the stack. This layout 
produces N x N stacks all working in parallel. 

Microbridge interconnections^] are used to stack the chips in the 3-D 
computer. The connections are made by tunneling through the chip sub- 
strate to the backside of the chip. The tunnel is highly doped to enable 
current transport between the chip circuits on the surface and the sub- 
strate's backside. A metal contact is made to the tunnel on the backside of 
one chip and the circuit-side of another chip. The two chips are then set 
on top of one another with the metal contacts touching. The microbridge 
interconnections are repeated throughout the array of processing elements 
and amongst all the chips. 

The problem with this scheme is that there is no area fan-out. The 
processing is similar to that between the photoreceptor and the ganglion cells 
of the retina (at the fovea) rather than similar to the processing between the 
ganglion cells and the visual cortex. For certain calculations, like binomial 
convolution, no fan-out is required if each chip in the stack performs one 
convolution. Still, this 3-D computer does not meet the criteria for remote 
processing. 

Another scheme for remote processing might be the multichip module[42] 
with a "flip-chip" imager. Figure 17 shows the multichip layout. The mul- 
tichip itself is a wafer containing numerous flip-chip sites. Each flip-chip 
site contains an array of pads where metallic contact will be made between 
the flip-chip and the multichip module. Within the multichip are numerous 
levels of metallic leads separated by interlevel dielectrics. The dielectrics en- 
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Figure 17: Multichip Modules with Flip Chip 
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sure that all metallic levels are electrically isolated. The metallic leads are 
arranged on the multichip to make connections between flip-chip pads. As 
many as 33 metallized layers have been fabricated with upwards of 12,000 
chip pads[43]. This is the mechanism allowing remote processing with area 
fan-out. 

Figure 17 also shows a proposal for an imager. The imager is a photo- 
transistor array grown on a silicon dioxide substrate. The phototransistors 
are on the top of the substrate but, since silicon dioxide is transparent, light 
stimulates the phototransistor by passing through the substrate. Metallic 
contacts are placed on the top of the phototransistors; the chip is flipped 
and soldered to the pads on the multichip, maybe. Although not applied to 
an imager, flip-chips have been produced with 16,000 pads on a 128 x 128 
array with solder bumps of 25 (im and inter-bump spacing of 60 /im[44]. 

The processing chips are arranged around the imager. The pads within 
any processing chip can have a density less than the pad density of the 
imager thereby allowing for area fan-out in the computation. 

Note how this multichip scheme satisfies the requirements for remote pro- 
cessing. Modularity: the flip-chips can be individually designed, fabricated 
and tested. The pad pattern must be standardized. Fan-out: the available 
computational area grows with each multichip module. Wafer scale inte- 
gration is not required for the complex processing and imaging chip. Only 
the relative simple multichip layout needs wafer scale techniques. Optical 
resolution is maintained. Some additional loses may exist due to substrate 
loses in the imager. 

Ultimately techniques must be developed for remote processing if real- 
time vision systems are to be produced. Already the work of Carver Mead 
has highlighted some of the deficiencies ahead if local processing is adhered 
to. Developments in remote processing may require a long-term commit- 
ment to research but should pay off with biologically relevant, modular, and 
highly-parallel devices. 

Digital techniques have achieved some success for real-time, 2D vision 
systems and, as an outgrowth, several useful chips, such a convolvers and line 
delays, are available. These chips can address some of the problems faced by 
early vision algorithms. Currently chip counts and cost may be high (for, 
say, binomial convolutions); however, the cost are dramatically less than 
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similar functionality (if available) in analog VLSI. Future development in 
digital image processing of convolvers with larger masks may help reduce 
chip costs and counts. Questions remain regarding the use of digital circuits 
for intermediate vision tasks. Still, digital circuits appear to be the best 
choice for implementation of vision algorithms in hardware today. 

Continuing research in computational vision should remain committed 
to general-purpose, software systems. Hardware apparently will remain to 
inflexible, limiting, and costly for rapid development of computational the- 
ory. However, for some cases, such as parameter estimation with resistive 
fuses, the computational requirements are so immense that special purpose 
hardware, analog or digital, may illuminate the computational questions. 
Research in remote processing schemes should be undertaken concurrently 
with computational vision research to ensure a future, real-time hardware 
system for vision. 
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